Motivation

Here is my attempt to tell you story, that I have received from the data. Where is the highest percent of dismissals from work? How does overtime connect with attrition? Is gender really matter? And many more, if you sit dawn and spend some of your time to read my report)

Prepare data for exploratory data analysis (EDA)

So, when I usually begin to discover history from data, I spend many hours prepare and cleaning dataset, but not now! Today I have nice, prepared dataset, and need only to do some magic staff in case of better visualization.

Ordinal data

Let’s begin from ordinal data, just because I want :). Let’s see how many attrition in each category of each parameter we have.

So what we see here? We understand, that most of employees travel rarely, work in RnD department, have bachelor or master degree and etc. But there are some more interesting facts. For example, why we see only two category of performance rating? Is it really true that in IBM all employees do their job so well? Also we see that there are unusually a lot of people in OverTime category that quit from work. But to tell you the truth,it’s a little bit difficult to understand picture due to imbalanced classes… Let’s invite old friend percent to this party!

Now we can made absolutely logical conclusion that people with frequent business trips,low JobInvolment, low Environment, Job, Relationship satisfaction and with bad WorkLifeBalance and zero StockOptionLevel more often leave company then the others. One can think that this much trivial conclusion, but I must proof it by data, so I do. What about more specific insights? I have it for you) Look carefully on JobRole category. There are something wrong with Sales Representative. Maybe they have terrible director, or maybe they have to OverTime often. By the way, look at OverTime section. It is strange, that peoples, who overtime, leave company often, then others. That’s maybe mean that people work hard but don’t feel any feedback from company. Now let’s see top 5 categories with the biggest attrition.

Here we see,that category with the highest attrition is JobRole:SalesRepresentive with 33 people that leave company vs 50 that stay. Suprising result is that people with high (24%) PercentSalaryHike leave company, but small numbers of data points give me a thought, that this is just fluctuation and nothing more, but we will keep in mind this fact. Now lets dive a little more dipper in details. Let’s see how attrition connect with OverTime and some other parameters

On each picture on x-axis we have YearsAtCompany for different parameters and different Attrition(on left Attrition = No, on left Attrition = Yes). At the first row two pictures are almost logical, but people,that don’t OverTime get more salary on the average. At the second row we see, that people, that work harder, get less percentSalaryHike, then those who don’t overTime on the average. Maybe this is effect of averaging, maybe bad work of HR department. At the last row we see another interesting fact: people, who leave company and work hard, get promotion less often, instead of people, that stay at company. Of course, this is maybe average effect, so CEO IBM shouldn’t dismissal all HR department :). Now let’s look at the problem of attrition from other side, from continuous data side. What we see now? There are some interesting thing - after about five years of work in current role or with current manager number of attrition increase. Of course, there are some trivial conclusions such as older people don’t like change work as usual, so percent of attrition smaller when people are older. If you like more digits and pairwise correlation, scroll to the end of report and you will get picture with a lot of of this staff)

So, we are looking in data set a lot and now we a ready for some…

Modeling

Let’s begin with simple and powerful method - logistic regression. I don’t want to scare you with formulas and other mathematical staff, just think about it as a black magic box, that can give you a probability of certain class of test data point. First of all, we need prepare data for modeling. To do so, I will use recipes package.

Let’s see some metrics. Think of AUC as bigger value better, max value = 1

## Calculated AUC = 0.71
## Table of correct and wrong classification
##       Predicted
## Actual          0          1
##      0 0.94267516 0.05732484
##      1 0.59259259 0.40740741

Let’s now look at variable importance table

##                             Feature Importance
## 1                      OverTime_Yes     100.00
## 2       EnvironmentSatisfaction_Low      56.42
## 3                  DistanceFromHome      52.23
## 4                NumCompaniesWorked      51.91
## 5  BusinessTravel_Travel_Frequently      46.52
## 6            WorkLifeBalance_Better      42.92
## 7          TrainingTimesLastYear_X3      39.49
## 8                JobInvolvement_Low      39.25
## 9      RelationshipSatisfaction_Low      38.78
## 10         TrainingTimesLastYear_X5      38.14

So as we see from EDA, OverTime is very important feature, and EnvironmentSatisfaction_Low is more important, then JobInvolvement_Low and RelationshipSatisfaction_Low.

Not bad for base model. Maybe we can better? Of course we can. Let’s try to invite XGBoost

Look at this metrics:

## Calculated AUC = 0.68
## Table of correct and wrong classification 
##  0.9426752 0.5925926 0.05732484 0.4074074

Hmmm… What?? Usually XGBoost makes better classification better, then logistic regression, but not today. Let’s look at importance of variables again and compare it with previous result

## # A tibble: 10 x 2
##    Feature            Importance
##    <chr>                   <dbl>
##  1 MonthlyIncome          0.120 
##  2 OverTime_Yes           0.100 
##  3 DailyRate              0.0700
##  4 Age                    0.0600
##  5 DistanceFromHome       0.0600
##  6 TotalWorkingYears      0.0500
##  7 MonthlyRate            0.0500
##  8 NumCompaniesWorked     0.0400
##  9 YearsAtCompany         0.0400
## 10 HourlyRate             0.0300

Again, OverTIme is in top of most important feature. So, if you see, that you researcher again works after eight p.m., don’t listen to him, send him to home :)

And time to heavy weapons, time to load H2O.

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         14 hours 16 minutes 
##     H2O cluster version:        3.16.0.2 
##     H2O cluster version age:    3 months and 24 days !!! 
##     H2O cluster name:           H2O_started_from_R_alex_wto371 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.53 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  3 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.4 (2018-03-15)

Connection successful, great. Let’s again begin with logistic regression but with grid search for best fitting parameters

To tell you the truth, I love H2O. This is powerful and comfortable framework for ML. Let’s look at some metrics

## Calculated AUC = 0.895
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.343925207847289:
##          0  1    Error     Rate
## 0      170 13 0.071038  =13/183
## 1       13 27 0.325000   =13/40
## Totals 183 40 0.116592  =26/223

Wow!! Very nice AUC for test dataset! Now time to look at variable importance:

## # A tibble: 10 x 3
##    Feature                          Importance sign 
##    <chr>                                 <dbl> <chr>
##  1 OverTime_Yes                          0.716 POS  
##  2 JobLevel_X2                           0.456 NEG  
##  3 EnvironmentSatisfaction_Low           0.335 POS  
##  4 NumCompaniesWorked                    0.288 POS  
##  5 BusinessTravel_Travel_Frequently      0.284 POS  
##  6 WorkLifeBalance_Better                0.284 NEG  
##  7 JobSatisfaction_Very.High             0.267 NEG  
##  8 StockOptionLevel_X1                   0.241 NEG  
##  9 StockOptionLevel_X2                   0.236 NEG  
## 10 YearsSinceLastPromotion               0.236 POS

So, now it’s time to make some general conclusions. First of all, do something with Sales representative, they leave company too often. Secondly, look carefully after workaholics with OverTime, maybe you don’t promote or increase salary as well as they work. Thirdly, don’t forget about workers, that travel frequently, they tend to be tired and leave company, probably it will be good idea to give one day vacation after business trip. And of course, don’t forget to work with such trivial, but important things as Environment Satisfaction,Job Satisfaction.

P.s. Promised picture with a lot of pairwise correlation

Picture with as much information about numerical data as I could paste in one box